Find this repository: https://github.com/libjohn/workshop_textmining
Much of this review comes from the site: https://juliasilge.github.io/tidytext/
The primary library package tidytext enables all kinds of text mining. See Also this helpful free online book: Text Mining with R: A Tidy Approach by Silge and Robinson
library(janeaustenr)library(tidyverse)
Warning: package 'dplyr' was built under R version 4.3.2
Warning: package 'stringr' was built under R version 4.3.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
Warning: package 'tidytext' was built under R version 4.3.2
library(wordcloud2)
Warning: package 'wordcloud2' was built under R version 4.3.2
library(textdata)
Warning: package 'textdata' was built under R version 4.3.2
Data
We’ll look at some books by Jane Austen, an 18th century novelist. Austen explored women and marriage within the British upper class. The novelist has a unique and well earned following within literature. Her works is consistently discussed and honored. To this day, Austen’s novels are the source of many adaptations, written and on-screen. Through the janeaustenr package we can access and mine the text of six Austen novels. We can call the collection of novels a corpra. An individual novel is a corpus.
austen_books()
Austen is best know for six published works:
austen_books() %>%distinct(book)
Data Cleaning
Text mining typically requires a lot of data cleaning. In this case, we start with the janeaustenr collection that has already been cleaned. Nonetheless, further data wrangling is required. First, identifying a line number for each line of text in each book.
Identify line numbers
original_books <-austen_books() %>%group_by(book) %>%mutate(line =row_number()) %>%# identify line numbersungroup()original_books
Tokens
To work with these data as a tidy dataset, we need to restructure the data through tokenization. In our case a token is a single word. We want one-token-per-row. The unnest_tokens() function (tidytext package) will convert a data frame with a text column into the one-token-per-row format.
The default tokenizing mode is “words”. With the unnest_tokens() function, tokens can be: words, characters, character_shingles, ngrams, skip_ngrams, sentences, lines, paragraphs, regex, tweets, and ptb (Penn Treebank).
You can customize stop-words data frames, sentiment data frames, etc.
There are various stop words dictionaries. Here we add the stop word, “farfegnugen” to a custom dictionary. If Jane Austen ever used the word “farfegnugen” that would be weird, or bad. So we will take pains to not calculate the sentiment of that word - whether or not the term shows up in a sentiment dictionary. That is, we will remove the word by making it a part of a customized stop-words dictionary.
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
Calculate sentiment
Algorithm: sentiment = positive - negative
Define a section of text.
“Small sections of text may not have enough words in them to get a good estimate of sentiment while really large sections can wash out narrative structure. For these books, using 80 lines works well, but this can vary depending on individual texts… – Text Mining with R
bing <-get_sentiments("bing")janeaustensentiment <- tidy_books %>%inner_join(bing) %>%count(book, index = line %/%80, sentiment) %>%# `%/%` = int division ; 80 lines / sectionpivot_wider(names_from = sentiment, values_from = n, values_fill =0) %>%# spread(sentiment, n, fill = 0)mutate(sentiment = positive - negative) # ALGO!!!
Joining with `by = join_by(word)`
Warning in inner_join(., bing): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
janeaustensentiment
Viz it
janeaustensentiment %>%ggplot(aes(index, sentiment, )) +geom_col(show.legend =FALSE, fill ="cadetblue") +geom_col(data = . %>%filter(sentiment <0), show.legend =FALSE, fill ="firebrick") +geom_hline(yintercept =0, color ="goldenrod") +facet_wrap(~ book, ncol =2, scales ="free_x")
Preparation: Most common positive and negative words
Warning in inner_join(., bing): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
bing_word_counts
Viz it too
bing_word_counts %>%filter(n >170) %>%mutate(n =if_else(sentiment =="negative", - n, n)) %>%ggplot(aes(fct_reorder(str_to_title(word), n), n, fill =str_to_title(sentiment))) +geom_col() +coord_flip() +scale_fill_brewer(type ="qual") +guides(fill =guide_legend(reverse =TRUE)) +labs(title ="Frequency of popular positive and negative words",subtitle ="Jane Austen novels",y ="Compound sentiment score", x ="",fill ="Sentiment", caption ="Source: library(janeaustenr)") +theme(plot.title.position ="plot")
Dictionaries
What other dictionaries are available? How to choose?
Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
emma_afinn
emma_afinn %>%count(word, sort =TRUE)
Make Sections
Just as we calculated sentiment, above, make sections of 80 words then calculate sentiment.
emma_afinn_sentiment <- emma_afinn %>%mutate(word_count =1:n(),index = word_count %/%80) %>%group_by(index) %>%summarise(sentiment =sum(value)) ## ALGO sum each Afinn score in the 80 word sectionemma_afinn_sentiment